Project Overview
The 22-23 Satisfaction survey was distributed April of last year to determine ways of improving the employee experience at ResponsiveEd. Alan Calvillo was tasked with determining trends in the data and reporting his findings.
Initial Presentation of Findings
Alan Calvillo created a results dashboard and Google Slide presentation upon completion of his analysis. His analysis consisted of examining multiple-choice scores to determine any major trends, and drawing trends from the free-response portion if no major trends were found in the multiple-choice portion. “Major trends” would be classified as survey questions that clearly and obviously received low scores across 1 or more demographic slice(s). The benchmark for determining a question as being “clearly and obviously” negatively rated was intuitively set at 20% (unfavorable rating) due to the “I am paid fairly” question, which never fell below 21% across all demographic slices.
Under this criteria, only one question (I am paid fairly) was universally negatively rated, all others received mixed reviews across demographics. Because of this, focus was shifted to the free-response portion of the survey to draw notable trends. Trends in the free response portion of the survey were determined by reading a sample of about 200 responses, where a topic would be considered trending if it was mentioned in more than 5 responses within the sample. Recommendations to address trending topics were created by Alan Calvillo & Casey Morgan using suggestions provided by the responses when available, and by formulating educated guesses on the best way to address trending topics when not. Using this simple sampling methodology, the following trends and recommendations were provided in the initial Google Slide Presentation.
Purpose of this Report
Upon reviewing the presentation with Casey Morgan & Pete Alonzo, the executive team tasked Alan Calvillo with providing more conclusive evidence for the recommendations provided in the presentation. Since the recommendations originally provided came from aggregating topics mentioned in the free-response portion of the survey, conclusive evidence would be provided via similar methodology. Specifically, this report aimed to classify all free-responses into falling into one (or more) of the 12 major topics used to draw the recommended action items in the original presentation. Due to the large amount of textual data from the survey, a machine learning model was attempted to be used for text classification. The model was unsuccessful, and sample estimation was ultimately used to calculate conclusive results.
Findings of this Report
Proportional sample estimation & weight-adjusted sample estimation was used to estimate the percentage of employees that mention the trending topics provided in the original presentation. Doing this would essentially rank the topics by importance through the creation of a comparable metric in the form of a topic mention rate (frequency at which each topic is mentioned in a given sample). We decided to narrow down definitive results to only the top 3 most frequently mentioned topics, which allowed us to provide multiple clear and concise action items (updated recommendations) for each of the 3 topics. These recommendations differ from the initial ones since the majority of the initial recommendations used survey responses to form an educated guess on the best solution, where updated recommendations definitively used frequently mentioned suggestions directly from survey responses. We believe that addressing the areas of concern that are mentioned by the most amount of people, would have the biggest impact on our workforce as a whole.
The goal of the following report is to provide supplemental evidence for the recommended action items & solutions provided by Alan Calvillo & Casey Morgan, in the 22-23 Satisfaction Survey Google Slide Presentation. Evidence in favor of our recommendations will be determined by aggregating survey responses that mention one or more of the following topics, where each topic represents an area of concern that our original recommendations aimed to address:
These topics aren’t listed in any particular order, the numbers assigned are included solely for the purpose of data manipulation. These topics were deemed areas of interest after reading through a sample of about 200 random responses, where each topic was mentioned more than 5 times throughout the sample. Having been given free-reign and no particular direction for extrapolating trend from the data, we chose to focus on the free-response portion of the survey to draw recommended action items for two reasons:
The 22-23 satisfaction survey consisted of a multiple choice section made up of 39 Likert scale questions, and a free-response section consisting of 5 open-ended custom questions. Upon completion of the survey, Alan Calvillo was tasked with creating a dashboard to view the data graphically. He was then tasked with creating a presentation of findings and any recommended action items, which would be edited/curated by Casey Morgan.
Logically, one could look to the distribution of scores per question
and pick out the top \(x\) lowest rated
questions to use as a starting point. The reason we didn’t do that was
because only one of the 39 Likert scale questions was universally rated
unfavorably across all demographic segments (age, tenure, location,
etc.), which was the I am paid fairly question. In other
words, we didn’t source responses for the top \(x\) lowest ranked Likert scale questions,
because the order of them changes with each demographic segment (with
the exception of the I am paid fairly question). While we
could have calculated unfavorable percentage without breaking it down
into demographic groups, doing so would have nullified the 2 months
spent creating the dashboard to view granular differences in the first
place. Even so, we wouldn’t have been able to easily find & filter
open-ended responses for those that rated questions unfavorable due to
not being given access to a full data set from Quantum for the sake of
respondent anonymity. We would have had to manually search for responses
that linguistically “match” topics/sentiment of the negatively rated
Likert scale questions. We cover why something like this would prove
difficult to do in Section 2.1, and it’s
ultimately the reason we chose to aggregate the responses to generate
relevant topics instead of trying to match responses to the topics of
the Likert scale questions.
Additionally, using the free-response portion of the survey not only gave us definitive areas of interest, but also gave us ideas for specific solutions sourced directly from the people being surveyed. Now that we’ve highlighted the reasoning used to determine our areas of interest, lets go over the contents of this report. This report consists of:
Let’s move on to the first section of this report, which consists of a detailed analysis of the survey responses as a whole.
Since it’s been a while since we’ve thought about the survey responses, I thought it would be beneficial to begin by summarizing the data as a whole by showcasing general information. This will include things such as the distribution of responses across brands/questions, and examining the lengths of the responses themselves.
The preliminary analysis essentially kills two birds with one stone by introducing the types of data visualizations to be used in the response categorization, as well as creating the sub data sets needed to categorize the responses. Let’s start by answering a couple basic questions:
How many total responses did we receive? Did some questions receive more responses than others?
As we can see from the graph above, not all the questions received a response, which is likely due to the order in which the questions were asked. This leads us to our next question, now that we’ve seen the difference in the amount of responses per question…
Can we see the difference in responses per brand?
In order to find this out, we’ll have to create a new count table
that sums the number of responses per brand. Although the task seems
simple enough, we first have to create a new dataframe that adds a brand
distinction, sum the responses per question for each brand, then sum the
question totals for each brand. Additionally, we’ll do some data
manipulation to shorten the original Location and
Question names, just in case we want to see specific
campuses or question breakdowns in later graphs.
| Brand | START Question | CONTINUE Question | STOP Question | WHY Question | General Comments | Total per Brand |
|---|---|---|---|---|---|---|
| FCA | 509 | 449 | 432 | 341 | 182 | 1913 |
| PHS | 249 | 223 | 221 | 156 | 87 | 936 |
| N/A | 180 | 159 | 149 | 112 | 59 | 659 |
| ICS | 149 | 133 | 134 | 99 | 58 | 573 |
| HO | 138 | 134 | 115 | 78 | 56 | 521 |
| QCA | 135 | 125 | 114 | 71 | 49 | 494 |
| ISVA | 113 | 107 | 98 | 67 | 42 | 427 |
| BX | 11 | 11 | 11 | 6 | 3 | 42 |
| BT | 8 | 8 | 8 | 4 | 5 | 33 |
We can see from the table that Founders received the highest total responses by a large margin. While the table above is very informative, we can get a clearer picture of these differences by using a detailed bar chart.
Thanks to this graph, now we can clearly see both the difference in
the amount of responses per brand and the amount of
responses per question. For each question, Founders
employees provided over double the amount of responses than the next
highest brand (PHS). The third highest group (in all but the
General Comments question) were those that selected
Prefer not to Disclose, followed by Ignite, Home Office,
Quest, iSVA, Blue-X, & Bright Thinker.
We can expand upon these response counts by examining them within the context of the data set as a whole, specifically by seeing the same data as percentages. Let’s start by answering a simple question,
Since we know that the “START” question received the most responses, do we know what percentage of total responses were for that question? What about the other 4?
Roughly 27% of all responses were answering the “START” question.
Looking at the percentage differences, we can see that the first 3
questions received roughly the same amount of responses with \(<3%\) variance drop-off. This is
favorable for our analysis since the START,
CONTINUE, & STOP questions are phrased
toward receiving targeted/specific answers. The WHY &
General Comments questions are phrased more for providing
context to answers, or supporting answers of the first 3 questions. We
can also produce the same stacked bar chart to see the percent
break-down of responses by brand:
From this graph we can see that Founders makes up the largest share
of responses at just over 34%. Another important
feature to highlight is that N/A responses account for over
10% of the total, which is worth noting as it’s not an
insignificant amount. What are rather insignificant amounts are the
percentages of responses that came from Blue-X (0.76%)
& Bright Thinker (0.58%).
Another key metric we should revisit is the participation rate, ideally, we would calculate the participation rate with the following formula: \[participation~rate=(\frac{total~respondents}{total~workforce})~*~100\]
The main issue with trying to accurately determine this percentage is
that we did not have access to a unique identifier for determining the
number of respondents, Quantum withheld this data for the purpose of
anonymity. Because of this, the only field we could use to determine
participation was Count of Responses, which showed the
count of responses per each question. This was used in the dashboard to
determine “Total Survey Responses” by dividing the
total number of responses by the total number of questions, but only for
the multiple-choice portion of the survey. \[75,752~individual~responses~/~39~questions~=~1,942.358~total~respondents\]
Quantum divided the reporting elements of the survey by multiple-choice
and free-response, and only included the
Count of Responeses field in the multiple choice reporting
elements. Luckily we can apply the same logic to the free-response data
in order to find \(total~participants\)
and use that figure to find \(participation~rate\). First we’ll divide
the number of total responses by the number of questions to get the
total number of respondents: \[5,598~responses~/~5~questions~=~1,119.6~total~respondents\]
Now that we have the number of total respondents, let’s determine the
number for \(total~workforce\). We know
that the survey was made available to every employee of the company
including substitutes. So we’ll take the average staff counts for all of
our corporate entities (RES, Blue Learning, & Bright Thinker) at the
time that the surveys were sent out, and sum them all
together.1 We can estimate our total 22-23
workforce population as: \[2,820~Total~RES~Staff~+~253~Substitutes~+~80~Blue~Learning~Staff~+~22~Bright~Thinker~Staff~=~3,175~Total~Workforce\]
Now lets plug all these figures into our original formula to calculate
the participation rate: \[(\frac{1,119.6}{3,175})~*~100~=~35.26\]
This gives us a comment participation rate of 35%. It’s obvious that this metric represents an average participation rate and not an actual rate, which is largely due to the variability in the number of responses per each question. While there’s no way to get a complete actual rate, we can calculate the actual rate of participants who answered at least one question fairly easily.
Taking the values for the question with the greatest number of responses (which was the “START” question in all instances) from each brand and summing them all up gives a total respondent count of \(1,486\). So if we measured comment participation as the percentage of total workforce that answered at least 1 question, then our actual participation rate would be: \[(\frac{1,492}{3,175})~*~100~=~46.99\] a 47% participation rate is a major improvement from 35%. An added bonus is that the question that received the most answers happens to be one of the three most important ones to this analysis. Measuring participation rate this way could be considered a more accurate approach as the questions that are less relevant to this analysis were the ones that received the fewest responses, and thus were drastically lowering the original participation rate by being included in the average calculation.
1 These figures were recorded in August of 2023 and are accurate active staff counts for May of 2023
Now that we have a good general sense of the variation in the amount
of responses, we can begin examining the quality of the
responses. Let’s begin by looking at the length of responses per each
question. This can be done by adding an answer_length field
to our original data set, and plotting the values in a box and
whisker plot. Before doing so however, we can briefly cover how
to interpret a box and whisker plot.
A box plot is used to display the 5 number summary of the data being explained, meaning that the 5 major elements of a box plot are:
The last element of box plots that aren’t necessarily covered in a 5 number summary are outliers. Outliers are calculated using the Interquartile Range (IQR) where \(IQR~=~Q3~-~Q1\). The IQR is then used to calculate lower range outliers (observations that fall below \(Q1~-~1.5*IQR\)) and upper range outliers (observations that fall above \(Q3~+~1.5*IQR\)).
That being said, let’s run a 5 number summary to see the response length breakdown for each of our questions.
| Question | Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|---|
| CONTINUE Question | 1 | 5 | 9 | 17 | 276 |
| General Comments | 1 | 8 | 21 | 45 | 552 |
| START Question | 1 | 6 | 12 | 26 | 446 |
| STOP Question | 1 | 3 | 9 | 22 | 323 |
| WHY Question | 1 | 20 | 43 | 88 | 720 |
We can see from these summaries that the values for the first 3
questions appear to be smaller than the values for the
WHY Question & the General Comments
questions. This could indeed support our hypothesis that the first
3 questions will hold the majority of useful information in our
analysis, and that the last provide ancillary (but not neccessary)
data. We’ll explore this further in the second half, but for now
let’s plot the data in a box plot so we can see the data in a
visual context.
Here we can see that the IQR (represented by the white boxes) is
largest in the WHY Question and the
General Comments question. We also see that the
WHY Question appears to have both the greatest
amount of outliers, and the most outliers
greater than 200 words. This adds support to our theory discussed
earlier, let’s take a closer look by examining a random sample from the
WHY Question pool of responses, and comparing it to a
random sample from the START Question pool. Let’s start by
looking at some responses to the START Question:
Just doing an initial comb through of this data, we can see that that
the majority of comments tend to be relatively short, without many
lengthy personal grievances or anecdotes. That being said, it’s just an
initial observations, we’ll need to produce a random sample for the
WHY Question as well.
This sample for the WHY question shows some lengthy
responses, especially when compared to the START question
sample. let’s take a side by side look at the summary statistics of
these samples versus their populations, to see if they look similar.
| Statistic | Actual | Sample |
|---|---|---|
| Minimum | 1 | 1 |
| 1st Quartile (Q1) | 6 | 6 |
| Median | 12 | 12 |
| 3rd Quartile (Q3) | 26 | 25 |
| Maximum | 446 | 197 |
| Statistic | Actual | Sample |
|---|---|---|
| Minimum | 1 | 1 |
| 1st Quartile (Q1) | 20 | 18 |
| Median | 43 | 38 |
| 3rd Quartile (Q3) | 88 | 80 |
| Maximum | 720 | 720 |
This looks very promising, the summary statistics are
extremely close, especially since the population of the
START Question & WHY Question are
\(1492\) & \(934\) responses respectively. Lastly,
let’s see one more side-by-side comparison of the populations
and samples as box plots to end the preliminary analysis.
Now that we’ve covered the basics around the comment data as a whole we can move onto the main purpose of this analysis, which is the estimation of topic representation among our survey responses. We’ll start by attempting to make a machine learning model that can accurately categorize responses into a predefined list (the recommendations given in the survey presentation) on our behalf.
Recall that the question that received the most responses, and likely
holds the most useful information, was the START Question.
Due to the tremendous amount of work it takes to train and tune a model
of this caliber, we’ll only be creating one model that analyzes the
responses to the “Please explain what you would like your
organization to START doing to make it a better place to
work”.
The simplest way we could approach response categorization would be to manually sort, filter, and label responses using things like Excel functions to do partial matches (REGEXP, FILTER, etc.). The problem is that this approach is extremely impractical, especially because our data set consists of over 5,000 individual responses.
Additionally, even regular expressions would be hard to use due to the variability in the formatting and capitalization of words and responses throughout our data set. As an example, let’s filter our entire data set to view responses that appear more than once, to see how our data is aggregated by default.
Just as I suspected, there are a lot of
repeat answers that aren’t being grouped together just because of subtle
nuances in the way that they’re spelled or stylized. Take the term
N/A for example, it appears in our data set several times
as: na, Na, N/a, etc… This causes
issues when trying to calculate the exact number of times a specific
response was given.
Because formulas are case-sensitive by default, they don’t automatically recognize “na” & “N/A” as being the same response. While this isn’t necessarily a big issue, it definitely makes simple categorization using spreadsheet functions a lot harder and more time-consuming. This is something we’ll address again in later portions, but I wanted to bring it up before we begin model creation to provide reasoning behind using modeling instead of traditional data cleansing methods.
This portion of the report covers an attempt at making a machine learning text classification model which was ultimately not used to determine employee sentiment. To skip to the portion of the report that ended up being used (sample estimation), click here
Now lets begin analyzing the responses from the
start_dataset, This part of our analysis involved a fair
bit of natural language processing (NLP), which is a field in data
analytics that can be used to process, analyze, and generate language
using statistics and machine learning.
The type of NLP we’re doing here is called Topic Modeling, and there are actually two methods we can use here. The first way is an aggregate analysis, which approaches the task of finding trends by analyzing each response in the context of the entire data set. This approach would be beneficial to do if we didn’t already have a predefined list of topics we wish to extract from the data, which leads us to our second method.
The second method is an individual analysis using text classification, which essentially summarizes each response down to their specific topic(s) in order to aggregate them into our predefined list of topics. This will be done by completing the following steps:
Step 1: Creating the training_subset
In order to complete step 1, we’ll have to create a training subset
from our start_dataset that labels each response as falling
into 1 or more of our predefined topics. This subset
will be used to train a machine learning model on what to look for in
the rest of the text data, in order to accurately assign it one or more
of the following predefined topics:
There’s no exact number of responses we need to manually label for our training data set, we just need to make sure we label enough to capture the variability in our responses, and a more or less even number of responses that can be labeled as being apart of each of the 12 predefined topics. We’ll aim to include about 10% (\(149.2\)) of the total responses in our training set, where each topic is adequately represented. Since \(149\) responses is exactly 10%, we’ll try to get at least an even number of responses for each topic (12 per each), We can see our training set down below
Representation isn’t exactly even across all of our topics, but that shouldn’t be an issue. The main area of focus in collecting this training sample was to include enough examples of each topic so that the algorithm has enough data to accurately sort the rest of the data set. Here’s a summary table showing the number of times each category was represented in a given response. You may have noticed from the table above that some of these responses mention more than one topic. This explains why some of the topics have a lot more mentions than others, it doesn’t mean that there are x number of responses that only mention a topic, it just means that that’s how many times a topic was mentioned period. There are some responses in our training set that mention 3 or 4 different topics, which is the reason we’re doing this analysis using a multi-label classification method.
| Variable | Total |
|---|---|
|
22 |
|
19 |
|
12 |
|
12 |
|
16 |
|
16 |
|
32 |
|
12 |
|
12 |
|
12 |
|
22 |
|
45 |
| # of topics mentioned | 232 |
Step 2: Preprocessing Our Data
The second step in our text classification process involves cleaning and preprocessing the text data. In Section 2.1, we covered how text data is aggregated by default, which causes issues when trying to create classification models. This is addressed here as part of the tokenization process, where we’ll standardize the data by removing capitalization and punctuation, which is crucial for enhancing the model’s ability to learn meaningful patterns.
In order to preprocess the data, we’ll use a process called tokenization, which is the process of splitting up our responses into a one-token-per-row format where a token represents a single word. Having the text in this format lets us manipulate, process, and visualize the text using a standard set of statistical tools in R.
The last step in preprocessing is the process of removing stop words.
Stop words are words that are not useful for an analysis, they
consist of typically used common words such as “the”,
“of”, “to”, and so on. We can remove stop words using
the stop_words dataset in the tidytext package, which
contains stop words from 3 different lexicons. We can use them all
together or filter() to only use one set of stop words
which we’ll do here. I opted to only use the snowball
lexicon of stop words because it appeared to be the most generalized. We
can see the list of words that will be removed during the process of
tokenization below:
Step 3: Feature Extraction (TF-IDF)
Now that our data has been cleaned and pre-processed, we can move onto the next step which is feature extraction via TF-IDF or Term Frequency-Inverse Document Frequency. TF-IDF is a statistical measure used to evaluate the importance of specific words to each document (response in our case) in a corpus (collection) of documents (responses). Since each response can be considered it’s own document for the purpose of our analysis, TF-IDF can be used to assign weights of importance to each word. This is calculated by multiplying the term frequency by the logarithm of the inverse document frequency, which simplifies the numerical value for each term in a given corpus, thus making aggregation (trend analysis) faster and more efficient for machine learning models. Let’s go over a simple example using the conditions of our analysis to better understand exactly how TF-IDF is calculated:
Now that we have a better idea of what TF-IDF is and how it’s calculated, we can use word clouds to see a visual difference between TF-IDF and term frequency using the data in our survey responses. Through doing so, we’ll cover the reasons why using TF-IDF in NLP is preferred over term frequency.
The word cloud above represents the top 50 most frequently used words found in our data set. These don’t really come as a surprise given the context of our survey population being mostly teachers. This is essentially the first half of our TF-IDF statistic (term frequency), next we’ll look at a word cloud that uses TF-IDF instead of just term frequency.
Recall that the IDF in TF-IDF stands for inverse document
frequency. This means that the term frequency for each term
found in our data set, is adjusted by the amount of times it isn’t found
in our data set, thus down weighting words like students,
school, or like because they’re found too
frequently throughout the corpus. This up-weights other words that may
not appear as much, but are highlighted due to not having such a harsh
down-weight adjustment. This explains why words like
custodian and communication appear larger,
since they may not have as high of a term frequency as other words, but
they have higher total TF-IDF figure. Now that our data is in a
numerical format suitable for modeling, we can proceed to the next
step.
Step 4: Modeling Our Training Data
Modeling text data often times utilized machine learning models in lew of traditional regression models due to the increase in computational power. Machine learning models are more suitable for processing matrices and vectors, which provide more opportunities for mapping relationships when compared to regression models. The exact type of model we’ll be using here is called a Binary Relevance SVM Classification Model. Binary relevance is essentially a one-vs-all strategy, where each topic is represented by a binary label (recall Table 2.1) and the presence or absence will be the result. This method will test each one of the 12 topics, thus creating 12 unique models.
To evaluate the accuracy of our models, we can use a holdout set of data to calculate performance metrics. A holdout set is essentially just a subset of data that is not at all present in the training set. We’ll have to prepare the holdout data in the same way as we did the training set, that means manually reading and labeling each comment, adding an observation number, striping capitalization & punctuation, and converting the data into a TF-IDF matrix.
While this sounds like a lot of work (it is), it could’ve actually
been a lot more. Had I not already created a function in Google sheets
that automates the task of exporting random rows, finding data that
wasn’t already apart of my training set could have been rather
complicated. We also saved time in creating a holdout set due to evenly
balancing the distribution of topic representation in our
training_set. Had we not purposely over-sampled topics that
were under-represented, and under-sampled topics that were
over-represented from the initial random sample we used to create the
training set (which we did to ensure the models had enough data to
accurately train themselves with), we would have had to make sure the
topic distribution of our holdout_set matched the topic
distribution of our training set. This just goes to show how nuanced and
complicated modeling text data can be, each step is essentially 5 steps,
where messing up one can ruin the entire thing.
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| 1 | 0.7171717 | 0.0985915 | 0.2592593 | 0.1428571 |
| 2 | 0.8956229 | 0.0952381 | 0.1428571 | 0.1142857 |
| 3 | 0.9326599 | 0.0000000 | 0.0000000 | NA |
| 4 | 0.9191919 | 0.0000000 | 0.0000000 | NA |
| 5 | 0.9057239 | 0.1000000 | 0.0500000 | 0.0666667 |
| 6 | 0.9326599 | 0.0000000 | 0.0000000 | NA |
| 7 | 0.7744108 | 0.1052632 | 0.1081081 | 0.1066667 |
| 8 | 0.9259259 | 0.0000000 | 0.0000000 | NA |
| 9 | 0.9158249 | 0.0000000 | 0.0000000 | NA |
| 10 | 0.7306397 | 0.0000000 | 0.0000000 | NA |
| 11 | 0.7811448 | 0.0370370 | 0.1333333 | 0.0579710 |
| 12 | 0.7609428 | 0.1470588 | 0.1063830 | 0.1234568 |
To put it bluntly, these scores are not good. Accuracy is calculated as \(\frac{TP+TN}{TP+TN+FP+FN}\), where \(TP=true~positives\), \(TN=true~negatives\), \(FP=false~positives\), and \(FN=false~negatives\). It calculated the proportion of true results among the total number of cases, where 100% accuracy is the max. While some of the models have up to 93% accuracy, this metric can be misleading since a model could be considered accurate simply by not identifying anything and seeing a lot of true negatives. This metric essentially measures the amount of times a model isn’t wrong, but just because it isn’t wrong doesn’t mean it’s right.
Precision is the ratio of correctly predicted true positives to total positives predicted (TP & FP), its calculated as: \(\frac{TP}{TP+FP}\) To give an example, Model 1 has a precision score of \(0.0985915\), meaning that when it predicted a positive outcome, it was correct 9.86% of the time. None of the models had a precision score higher than 15%, suggesting a high number of false positives.
Recall is similar to precision except it measures
the ratio of correctly predicted true positives to
all the actual true positives, its calculated as: \(\frac{TP}{TP+FN}\) Again looking at Model
1, it has a recall score of \(0.2592593\), meaning it correctly
identified 25.93% of all the actual true positives in
the holdout_set. None of the other models even came close
to Model 1’s recall score, suggesting a very high number of false
negatives.
F1 Scores are a weighted average of Precision and Recall, but since none of the models had particularly good scores for either, the F1 scores are essentially obsolete.
There are many reasons why the models could have under-performed, which we’ll touch on in a second. Let’s first take a closer look at how the models would have classified each response by creating a table with their predictions, and see it side-by-side to our actual topic classifications.
We can see just how inaccurate the model was, failing to capture the topics in almost all of our responses. Unfortunately this is pretty common for an initial model, as machine learning techniques are extremely complicated and require a lot of tuning and improvement before being completed. There are a multitude of things we can do to try to improve these results, with the simpler steps being re-balancing the ratios of topic representation and enhancing feature extraction by exploring different data transformations such as n-grams. More complex options include trying different classification methods entirely such as random forests, gradient boosting, and even deep learning techniques via neural networks.
While I would love to continue experimentation and model improvement, a large amount of time has already been dedicated to the creation of the training set and holdout set of this model and tuning the model would likely take even more time than is warranted for the goal of our analysis. All is not lost however, while the goal of creating a accurate model to automate response classification wasn’t fully achieved, I learned a lot in the process of creating this first version. I can confidently say that with enough time, an accurate classification model can be made to summarize and classify textual data in a future analysis.
So what now? Well, while manually classifying responses for the training set and holdout set to be used in the text classification model, I had the following thought:
I’ve already had to read a lot of responses for the creation of these subsets, couldn’t I just use these as a sample to estimate the rate at which each topic is mentioned in the population?
The short answer is yes, I can. It would require some additional manipulation to the sample if I wish to be as accurate as possible, but after doing that, the rest of the analysis should be relatively straight-forward. To accomplish this, we would have to:
Step 1: Optimizing the Holdout Set for
Proportional Estimation
You may have noticed the term “proportional” being included when describing the sample we wish to generate here, and its for good reason. This means that our sample’s composition should mirror the demographic distribution of the total population we’re examining. Since the only demographic we have access to is location, each location in our sample must be included in the same proportion as it exits in the overall population, regardless of sample size. Here’s a simple example that determines the proportion of responses from FCA Conroe in a 21% sample size:
Now that we understand the importance of gathering a sample that
mirrors the proportions of the population, let’s begin planning what
further manipulation we’ll have to do to our labeled data to make it
into a usable sample. Recall that in Section
2.2; Step 1, for the creation of the training_set, we
deliberately over and under-represented certain topics to get equal
representation across a 10% sample size. Since the data
labeled in this subset was not entirely random, it won’t
accurately reflect the general sentiment of the population,
meaning its not usable for proportional sample estimation.
That leaves us with only the responses categorized for the
holdout_set, which was randomly generated
for the purpose of model testing, and consists of a 20%
sample size. While the data from the holdout_set is usable, we’ll still
have to add and remove responses based on their location, in order to
achieve a proportional sample. To do this, we’ll create a table that
shows:
holdout_setNow that we have a clear answer on which schools are over and
under-represented, we’ll add and remove responses from each location so
that they are proportionally represented. This will be done by using a
random import function in Google Sheets for responses that need to be
added, and a random name picker for responses that need to be removed.
We’ll name this new subset proportional_sample and move
onto the next step.
Step 2: Calculating Topic Representation in the
proportional_sample
Now we’ll begin calculating each topics mention rate in our sample.
Let’s start by creating a table similar to Table
2.2, that shows a summary of the number of times each topic was
mentioned in our sample. In addition to all the predefined topics, this table will also include
2 new topics (13. No Comment &
14. Satisfied with organization). These new topics will
hopefully allow us to get a better idea of the number of responses that
fail to mention any of the 12 predefined topics.
Unique to this table will be the inclusion of a lower bound and upper bound confidence interval statistics, or CI for short.
Let’s briefly cover what confidence intervals are and why they are being included here. In statistics, characteristics of a population are often estimated from a sample. While these estimations can be insightful, due to random chance they are likely not 100% accurate reflections of the true characteristics of the population. The following image further explains this concept:
Essentially, confidence intervals can be used to estimate a range in which the true characteristic (in our case, rate of topic representation) is likely to be found, with a certain probability (i.e. confidence level). Confidence levels can be adjusted to whatever range you find most appropriate for your analysis, but a 95% confidence level is pretty standard in statistics, which is what we’ll use here. How exactly are the upper and lower range confidence intervals actually calculated? The following image shows the formula and a visualization of a normal distribution:
Now that we understand how to interpret confidence intervals, let’s take a look at the summary table to see the breakdown of topics found in our proportional sample from most to least mentions.
| Variable | Total | Proportion | LowerCI | UpperCI |
|---|---|---|---|---|
|
41 | 13.67% | 10.24% | 18.02% |
|
40 | 13.33% | 9.95% | 17.65% |
|
26 | 8.67% | 5.98% | 12.40% |
|
19 | 6.33% | 4.09% | 9.68% |
|
19 | 6.33% | 4.09% | 9.68% |
|
18 | 6.00% | 3.83% | 9.28% |
|
17 | 5.67% | 3.57% | 8.89% |
|
16 | 5.33% | 3.31% | 8.49% |
|
15 | 5.00% | 3.05% | 8.08% |
|
12 | 4.00% | 2.30% | 6.86% |
|
10 | 3.33% | 1.82% | 6.03% |
|
5 | 1.67% | 0.71% | 3.84% |
|
3 | 1.00% | 0.34% | 2.90% |
|
2 | 0.67% | 0.18% | 2.40% |
Now lets use these mention rates to draw inferences on the entire population in the 3rd and final step.
Step 3: Drawing Population Inferences From the Sample
Let’s begin by estimating the number of staff who identify with each topic. Recall from section 1.3 that the true population size at the time that the survey was distributed was \(3,175\) employees. Using the Proportion, Lower-bound CI, & Upper-bound CI percentages from Table 3.1, we’ll create a table showing the estimated number of total workforce that sympathize with each topic.
| Variable | Lower Estimate | Estimated Total | Upper Estimate |
|---|---|---|---|
|
325 | 434 | 572 |
|
316 | 423 | 560 |
|
190 | 275 | 394 |
|
130 | 201 | 307 |
|
130 | 201 | 307 |
|
122 | 190 | 295 |
|
113 | 180 | 282 |
|
105 | 169 | 270 |
|
97 | 159 | 257 |
|
73 | 127 | 218 |
|
58 | 106 | 191 |
|
23 | 53 | 122 |
|
11 | 32 | 92 |
|
6 | 21 | 76 |
While we could stop here and use these population estimates to draw conclusions, there are a few ways to manipulate our sampling method and data pre-processing, in order to give more definitive results. Doing so would address potential biases that could be skewing our results one way or another.
According to our survey sample, we can say with 95% certainty that roughly 10% of our total workforce population (\(430\) employees) are dissatisfied with their compensation. While that doesn’t seem like a lot, it’s important to contextualize these figures within the broader scope of the survey itself. What we mean by this is that there are a number of potential biases that could be causing under-representation of the true sentiment of our workforce population.
Self-Selection Bias
One of the broader factors that could be skewing our results is the potential for self-selection bias. This occurs when the individuals who choose to respond are not a random sample but are self-selected participants, that tend to have strong opinions. Whether negative or positive, these participants are more likely to take the time to respond to surveys or provide feedback, leaving out the moderate or indifferent voices.
It’s reasonable to assume that under the conditions of this bias, the majority of the unobserved population (the moderate/indifferent) are more likely to agree (to a degree) with the sentiment of the self-selected participants than not agree. Since we measured free-response participation rate of this survey as being 47%, that would mean that roughly 53% of our total workforce could be considered “disengaged”. One could argue that the sentiment of the disengaged should be accounted for in one way or another, since being disengaged could be considered on-par with being dissatisfied.
Introducing Non-random Sampling Bias vs Dealing with Underestimation of Topic Representation
Another major factor that could be skewing our results also revolves around disengagement, specifically the large amount of “unhelpful” responses first covered in Section 2.1. While unhelpful responses (“no comment”) only made up about 6% of our sample, that’s 6% less useful data that could have been used to enhance the precision of topic representation. Utilizing NLP techniques introduced in Section 2.2, we can actually find and exclude certain responses deemed “unhelpful” relatively easily.
The only issue with removing unhelpful responses is the potential to introduce non-random sampling bias. This can occur if the unhelpful responses are not uniformly distributed across our population, especially if certain segments of the population are more likely to give unhelpful responses than others. This could of course be accounted for by calculating response-drop ratios for each location, and seeing if they’re uniformly distributed, but this process would be cumbersome and inefficient.
Addressing Non-random Location Sampling Bias
The last major factor possibly skewing our results pertains to
potential demographic imbalances being caused by the
Prefer Not to Disclose location option. Recall that in Step 1 we had to calculate the
proportions of random responses needed from each location, in
order to collect a proportionally accurate sample of our population. But
is FCA Conroe truly proportionally represented if half of their staff
selected the Prefer Not to Disclose option? As is, the
sample lacks granularity to examine location specific sentiment due to
the potential for bias. For example, if respondents who feel negatively
about certain issues are more inclined to hide their location, the
topics regarding location specific issues (Topic 1, 9, & 10) are
likely skewed.
Is There a Solution?
We’ve covered how the unobserved population makes up roughly
53% of our total workforce, and how they are more
likely to hold the same sentiment as the people who
actually answered than not. We’ve also covered how unhelpful responses
could be down-weighting topic sentiment, along with the possibility of
introducing non-random sampling bias by removing unhelpful
responses if they’re not uniformly distributed
across all the locations. Lastly, we went over the potential non-random
sampling bias that could already be present in our population via the
Prefer Not to Disclose location option.
All that being said, is there a way to account for the large disengaged portion of our population, while increasing the importance of helpful answers, and eliminating any potential non-random sampling bias?
As a matter of fact, yes. We can do this by:
Lets explain what we mean by
"using total workforce... instead of response population".
In Section 3.0 we calculated the N sample size as a percentage of the
total responses. Since we’re omitting roughly
17% of the original data set as part of step 1 here, it
doesn’t really make sense to use the same 20% of total
responses method we used earlier, since we’d essentially be
sampling a sample. If we did calculate the sample in the same way
post-omission, our sample would actually decrease in size since omitting
responses would shrink the size of the entire dataset. Instead, it makes
more sense to use the true population (3,175) to calculate sample size, especially
since it also helps solve potential non-random sampling bias that could
be introduced via omission of unusable data.
Likewise, if data were not uniformly omitted across all locations,
calculating a PSC in the same way we did before could result in
inaccurate proportions and negatively affect the final weight
adjustments. This means that PSC will be determined by the
proportion of each locations staff count to the
total staff count (percentage-of-whole), and
not as the proportion of responses from each location
to all responses. Using the true population for calculating proportional
sample contributions is essential, especially since we’re deliberately
trying to account for the missing population via omission of unusable
data and response weighting. 10% of our true population
totals out to about \(318\) individual
responses and is fairly close to the 20% sample size
used for the holdout_set (\(298\)).
Now that we understand steps 1 & 2, what do we mean by up-weighting exactly? For each location, we’ll calculate a response-weight adjustment, that will be applied to each response before its recorded as a topic mention. The response-weight adjustment will be calculated by dividing the number of remaining responses from each location (post-omissions), by their adjusted 22-23 staff count. We’ll cover why we’re using an adjusted staff count in lieu of an actual staff count later, but for now let’s work through an example using the same scenario from Section 3.0: Step 1 to better understand the steps involved with this refined sampling method:
Now that we’ve outlined the steps involved with refining our sampling methodology, let’s start making the necessary changes to our population and sample set.
Recall the introduction to NLP preprocessing in Section 2.2: Step 2, here we first
introduced the concept of stripping capitalization & punctuation
from text in order to aggregate responses easier. In that section, this
was done automatically as part of the tokenization process. Here, we’ll
have to use specific functions to do this, in order to more easily find
and remove unhelpful answers from our start_dataset. While
it is cumbersome and time consuming as we mentioned in Section 2.1, we’re only doing this for
eliminating one particular type of response. Doing so for manual
classification of all 12 of our topics would have been a different
story.
Step 1: Removing Unusable Data From the Population
Let’s start by creating a table, adding a new
standardized answer row, and aggregating any repeat
answers.
This table helps us tremendously, we’re able to see a handful of
responses that we could easily remove such as na and
nothing, but I’m certain that there’s more responses than
just the repeat ones we see above that would classify as
“unhelpful”.
We can get a filtered view to see specific responses that match a
criteria using regular expressions. Specifically, we’ll look through
responses that start with I or N. The reason
being that most unhelpful responses are likely going to start with I
statements, or sentences that begin with negative sentence fragments
(not sure, I... or nothing, I...). For the
sake of not being repetitive, we’ll do this process of combing through
and removing responses independent from this report. The following table
will be the final version of the START dataset
post-data-cleansing.
It looks like removing unhelpful responses decreased the number of
total responses from \(1492\) to \(1410\), a decrease of
5.49%. It’s important to note that the estimated
percentage of responses that were unhelpful in our proportional_sample was
6.33%. This is very close to the true population
percentage of 5.49%, and well within our
95% confidence level. One could argue that this
estimate provides circumstantial evidence that our proportional sampling
technique is indeed an accurate representation of the whole
population.
Next let’s remove the Prefer Not to Disclose responses
from the population, and then recalculate Table
3.0 in order to see what changes we’ll have to make to the
proportional_sample we used for estimating our sample
pre-omissions.
Step 2: Calculating Percentage-of-Whole and Proportional Sample Contributions
While I did this already to determine the amount of responses needed
from each location for our proportional sample in Table
3.1, we have to do this again here since we won’t be
using total responses to determine proportional sample
contributions (PSC). Instead, we’ll be using total
workforce to calculate PSC due to altering the response count
by removing responses. However, we can’t easily calculate PSC for all
locations since we’d need accurate historical staff
counts for all company entities and campuses
at the time that the survey was distributed, then
calculate the percentage-of-whole (POW) each location makes-up of the
true population.
In order to calculate POW for all location options, we’ll have to import and calculate average staff count for each location option in the survey, which also comes with several challenges. While we have accurate staff counts for all RESED TX school locations, we cannot say the same for Arkansas locations, Home office departments, Blue-X, or Bright Thinker. Calculating average staff counts for these locations will be difficult as historical staff counts for these entities are rarely kept track of. For the sake of time, I ended up estimating as best I could for the locations where I don’t have easy access to historical staff counts.
Arkansas school totals were estimated using the latest active staff
count data I had access to, which was for September of 2023. There was a
count differential after adding the AR totals to the RESED TX totals,
which was made-up by proportionally adjusting each AR school via their
POW until the Total RES Staff number of \(2820\) first outlined in Section 1.3 was matched. Home office departments
were estimated using EOY Skyward counts for 22-23, and adjusted from the
Blue Learning total since we know that a portion of home office
employees are technically Blue Learning. Blue-X was also estimated from
the BL total, leaving the proportional adjustment of all schools to
account for substitutes as the last step.
This is why we mentioned earlier that there’s a reason we’re using an adjusted staff count in lieu of an actual staff count, as it would be nearly impossible to determine accurate staff counts at the time of the survey. Once the staff counts have been calculated, we’ll use the POW ratio from each location to calculate a new 10% proportional sample contribution of the true population.
Now that we know which locations we need more responses from for a
proportional 10% sample, we’ll export the post-omission
data set to Google Sheets, and randomly select and classify the
responses needed to complete this final sample. Next we can see how this
final optimized sample will differ from the previous
proportional_sample by comparing counts. We’ll also include
the weight adjustment that will be applied to responses
from each location based on their staff count/response differential as
discussed earlier.
The process of weighting doesn’t change the actual count of responses in our final sample, but in their contribution to the analysis. Take FCA Austin for example. This table shows us that there is one additional response in the final sample than in the one used before, because they make up a higher POW ratio to total workforce than to total responses. Furthermore, each of the responses from FCA Austin will be weight-adjusted by a value of \(3\) to make up for the difference between the total responses post-omission, and the 22-23 adjusted staff count.
Since this final sample represents a refined-optimized version of our
original proportional_sample, I decided to add two
new topics that weren’t apart of our original 12
(14. favoritism or lack of staff accountability &
15. better benefits besides PTO). At this point I’ve likely
read through almost half of the original 1400 responses, so I believe
the addition of these last 2 topics should capture the remaining
majority points of discussion not apart of our predefined list.
Now that we understand how the responses will be weighted, let’s
calculate the topic metrics using our final_sample and go
over the results.
| Topics Mentioned | Weighted Sample Count | Sample Proportion % | LowerCI % | UpperCI % | Estimated Total | Lower Estimate | Upper Estimate |
|---|---|---|---|---|---|---|---|
|
144 | 45.28% | 40.75% | 49.89% | 1438 | 1294 | 1584 |
|
137 | 43.08% | 38.59% | 47.69% | 1368 | 1225 | 1514 |
|
86 | 27.04% | 23.15% | 31.32% | 859 | 735 | 994 |
|
71 | 22.33% | 18.73% | 26.39% | 709 | 595 | 838 |
|
67 | 21.07% | 17.56% | 25.07% | 669 | 558 | 796 |
|
53 | 16.67% | 13.51% | 20.38% | 529 | 429 | 647 |
|
50 | 15.72% | 12.66% | 19.37% | 499 | 402 | 615 |
|
36 | 11.32% | 8.72% | 14.58% | 359 | 277 | 463 |
|
34 | 10.69% | 8.17% | 13.88% | 339 | 259 | 441 |
|
33 | 10.38% | 7.89% | 13.53% | 330 | 251 | 430 |
|
30 | 9.43% | 7.07% | 12.48% | 299 | 224 | 396 |
|
23 | 7.23% | 5.19% | 10.00% | 230 | 165 | 318 |
|
19 | 5.97% | 4.14% | 8.55% | 190 | 131 | 271 |
|
14 | 4.40% | 2.86% | 6.71% | 140 | 91 | 213 |
|
6 | 1.89% | 0.98% | 3.61% | 60 | 31 | 115 |
Since this sample is using a great deal of estimation, I thought it would be appropriate to widen the confidence level from 95% to 90% to account for these conditions. Linguistically, this means that we’re certain that 90% of the time, the true topic mention rate (Sample Proportion %) will fall between these two values (LowerCI % & UpperCI %).This sample estimation tries to account for individuals that (for one reason or another) were not represented in the survey population we originally pulled from for Table 3.1. It does this by weighting any remaining useful responses according to their campus’s response-to-count ratio, the logic being that sentiment among the staff at each location is more likely to be aligned that not. Of course, there could still be cases where one individuals response could be an outlier, but that’s the case in all sample estimations, and that’s why it’s important to consider this when setting confidence intervals. That being said, these metrics give us more of an actionable insight as to which issues to address and prioritize. Let’s move onto the final step of this analysis which is to build action items based on specific comments.
Since the final_sample is technically an estimation,
let’s bring back the proportional_sample summaries so we
can compare it to the refined sample summary done above.
| Variable | Total | Proportion | LowerCI | UpperCI | Lower Estimate | Estimated Total | Upper Estimate |
|---|---|---|---|---|---|---|---|
|
41 | 0.1367 | 0.1024 | 0.1802 | 325 | 434 | 572 |
|
40 | 0.1333 | 0.0995 | 0.1765 | 316 | 423 | 560 |
|
26 | 0.0867 | 0.0598 | 0.1240 | 190 | 275 | 394 |
|
19 | 0.0633 | 0.0409 | 0.0968 | 130 | 201 | 307 |
|
19 | 0.0633 | 0.0409 | 0.0968 | 130 | 201 | 307 |
|
18 | 0.0600 | 0.0383 | 0.0928 | 122 | 190 | 295 |
|
17 | 0.0567 | 0.0357 | 0.0889 | 113 | 180 | 282 |
|
16 | 0.0533 | 0.0331 | 0.0849 | 105 | 169 | 270 |
|
15 | 0.0500 | 0.0305 | 0.0808 | 97 | 159 | 257 |
|
12 | 0.0400 | 0.0230 | 0.0686 | 73 | 127 | 218 |
|
10 | 0.0333 | 0.0182 | 0.0603 | 58 | 106 | 191 |
|
5 | 0.0167 | 0.0071 | 0.0384 | 23 | 53 | 122 |
|
3 | 0.0100 | 0.0034 | 0.0290 | 11 | 32 | 92 |
|
2 | 0.0067 | 0.0018 | 0.0240 | 6 | 21 | 76 |
While the counts themselves are a lot higher in Table 3.7 due to the
weight-adjustments, the order of the most frequently mentioned topics
remain largely unchanged between our original sample and the refined
one. The #1 most mentioned topic among all staff is
Increase in compensation, #2 is
Improve communication, and #3 is
Feeling understaffed or overworked. Let’s take a closer
look at each of these to draw closing arguments and action items.
Most Mentioned Topic: Increase in Compensation
To get a better idea of the specific things people said in regards to
Increase in Compensation, let’s create a table that shows
all the responses that mentioned this topic in our
final_sample.
The majority of responses don’t elaborate past outright stating “increase pay”, but there are a couple of mentions of matching compensation to that of local ISDs, doing regular salary reviews, and cost of living adjustments. In a perfect world, we would just increase everyone’s salary by 10% and everyone would be happy, but that’s just not feasible at this time. Some low-cost solutions recommended by Casey Morgan are to:
Both of these suggestions could be completed within a relatively short-timeline and would show progress towards fulfilling an increase in compensation.
Because these solutions don’t immediately address the compensation concerns, I recommend being transparent with our workforce as to why we’re not currently able to do so. Transparency would go a long way here, and it would be an easy way to improve communication which also happens to be the second most mentioned topic. Explaining the reasons why we’re not able to immediately increase compensation could stifle any frustration potentially brewing within the workforce population.
2nd Most Mentioned Topic: Improve communication
Now let’s look at the responses from our final_sample
that mention improving communication.
The responses here vary similarly to the previous example for increasing compensation, but defining clear solutions to specific communication problems is more difficult without more details. There are several responses that mention better communication surrounding new system/product implementations, and other responses that mention proactive communication regarding staff leaves. While these are certainly important grievances, there aren’t really any definitive ways to improve communication channels regarding these specific issues at the campus level. This question serves as a good example for targeted questions we could ask our workforce in future surveys, for example:
As for some definitive solutions, Casey recommends:
These ideas would be fairly straight-forward to implement, and are actual recommendations sourced from survey responses.
3rd Most Mentioned Topic: Feeling understaffed or overworked
Lastly, lets look at responses that mention feeling understaffed or overworked.
Of the top 3 most mentioned topics in our final_sample,
responses that mention Topic 1 vary the most in language and sub-topic.
In other words, the objective interpretation of topics is most evident
with these responses, and we have a clearer image as to why our
machine-learning model performed so poorly. The majority of responses
mention hiring additional support in the form of counselors, classroom
aides, and campus security.
Some recommendations provided by Casey Morgan include:
The last of which is actually already being implemented as part of an HR Operations project led by Pete Alonzo. Adding additional sick days and implementing a school marshall program would also address other frequently mentioned topics in the survey responses (“More PTO” & “Student disciplinary system issues” respectively). Ultimately, regardless of the changes we make as a result of this analysis, we’ve learned that satisfaction survey’s not only give valuable insight into the general sentiment of our workforce, but also provides an opportunity for them to voice concerns and contribute to the operation of our organization as a whole.